\section{Introduction}

We consider the problem of performing SQL-style join operations on tables that are secret shared among three parties, in the presence of an honest majority. Our proposed protocol takes two or more arbitrarily secret shared database tables and constructs another secret shared table containing a \texttt{join} of the two tables, without revealing \emph{any} information beyond the secret shares themselves. Our protocol is constant-round and  has computation and communication overhead that is linear in the size of the tables. Simulation-based security is achieved in the semi-honest setting with an honest-majority. Our protocol can perform inner, left and full joins along with union and arbitrary computation on the resulting table, with the best performance and security guarantees when joins operate on unique primary keys.

New techniques \cite{usenix:PSZ14,USENIX:PSSZ15,PSZ16,CCS:KKRT16,PSWW18,CLR17,CHLR18,cryptoeprint:2017:738,RA17,KLSAP17,OOS17,KMPRT17} for performing set intersection, inner join and related functionalities have shown great promise for practical deployment. 
The vast majority of these works perform {\em private set intersection} (PSI), which is analogous to revealing the entire result of an inner join.
Computing a join {\em without revealing it} (\ie, performing further joins, filtering, or computing only aggregate information) is significantly harder, and optimizations for such a setting do not automatically translate to our composable setting.
We highlight a few notable results that hide the contents of a join (revealing only some function of it):

Ion et al.\ recently deployed a private set intersection sum protocol\cite{cryptoeprint:2017:738} to allow customers of Google Adwords to correlate online advertising with offline sales, while preserving user privacy. Pinkas et al. \cite{PSWW18} also introduced a practical protocol that can compute any (symmetric) function of the intersection and associated data. These protocols can be framed in terms of SQL queries consisting of an inner join followed by an aggregation on the resulting table, e.g. summing a column.  
Neither of these protocols (and almost no prior related results) support secret-shared inputs, but rather require the source tables to be held in the clear by each party.

The majority of these protocols consider the two party setting and are based on various cryptographic primitives, e.g. exponentiation~\cite{cryptoeprint:2017:738}, oblivious transfer\cite{PSWW18}, or fully homomorphic encryption\cite{CLR17}. However, in this work we alter the security model to consider three parties with an honest majority. The motivation is that typical protocols in this setting (e.g.\cite{highthroughput}) require less computation and communication than similar two party protocols, by a factor of at least the security parameter $\kappa=128$. Moreover, we will see that the honest majority enables various algorithms which are orders of magnitude more efficient, e.g. oblivious permutations require $O(n)$ work instead of $O(n\log n \kappa)$ \cite{MS13}. 

Given this observation we investigate how to leverage the efficiency gains in the three party setting to construct practical protocols for performing set intersection and other SQL-like operations where both the inputs and outputs are secret shared. One critical aspect of this input/output requirement is that join operations can then be \emph{composed} together, where the output of a join can be the input to another. Allowing this composability greatly increases the ability to perform highly complex queries and enables external parties to contribute data simply by secret sharing it between the primary parties which participate in the protocol.

The fact that we support secret-shared inputs also leads to \emph{outsourced secure computation.} Three non-coluding servers can be established, and inputs can be provided by the servers themselves or by external parties simply by secret sharing the input among the servers. This model has been gaining traction in industry. Mozilla recently deployed a service to collect telemetry data about Firefox\cite{mozilla} using two non-colluding servers running the Prio protocol\cite{DBLP:conf/nsdi/Corrigan-GibbsB17}. Other examples include privacy preserving machine learning frameworks which support secret shared inputs such as Facebook's Crypten built for PyTorch\cite{Crypten} and Cape Privacy's FTEncrypted built for TensorFlow\cite{TFEncrypted}. 

Complex joins on secret-shared inputs/outputs are valuable in all of these examples.
Most privacy preserving machine learning publications\cite{aby3,Crypten,TFEncrypted,SecureNN,XONN} assume that the data being trained on has already been joined together. However, without a framework like ours it is unclear how this would be accomplished while preserving privacy and efficiency. The generality of our approach allows us to solve these problems and many others, e.g. the two applications presented in \sectionref{sec:app}.


\subsection{Functionality}

Our protocol offers a wide variety of functionality including set intersection, set union, set difference and a variety of SQL-like joins with complex boolean queries. Generally speaking, our protocol works on tables of secret shared data which are functionally similar to SQL tables. This is in contrast to traditional PSI and PSU protocols\cite{usenix:PSZ14,USENIX:PSSZ15,PSZ16,CCS:KKRT16} in that each record is now a tuple of values as opposed to a single key. 

We define our database tables in the natural way. Each table can be viewed as a collection of rows or as a vector of columns. For a table $X$, we denote the $i$th row as $X[i]$ and the $j$th column as $X_j$. 
\iffullversion
Note that each column of a table has the same length but can contain different data types, e.g. $X_1$ is a column of 32 bit fixed point decimal values and $X_2$ is a column of 1024 bit strings.

\fi
Our core protocol requires each table to contain unique values in the column defining the join  (i.e., we can only join on ``unique primary keys''). For example, if we consider the following SQL styled join/intersection query
\vspace{-0.1cm}
$$
\texttt{select } X_2 \texttt{ from } X \texttt{ inner join } Y \texttt{ on } X_1 = Y_1
\vspace{-0.1cm}
$$
then the join-keys are $X_1$ and $Y_1$. This uniqueness condition can be extended to the setting where multiple columns are being compared for equality. 
\iffullversion
In this case the tuple of these columns must be unique. 
\fi
Later on we will discuss the case when such a uniqueness property does not hold. Our protocols also support a \texttt{where} clause that filters the selection using an arbitrary predicate of the $X$ and $Y$ rows. Furthermore, the \texttt{select} clause can also return a function of the two rows. For example,
\vspace{-0.1cm}
\begin{align*}
\texttt{select }&  X_1,max(X_2, Y_2)  \texttt{ from } X \texttt{ inner join } Y \\
\texttt{on }& X_1 = Y_1 \texttt{ where } Y_2 > 23.3
\vspace{-0.1cm}
\end{align*}
In general, the supported join operations can be characterized in three parts: 1) The select function $S(\cdot)$ that defines how the rows of $X,Y$ are used to construct each output row, e.g. $S(X,Y)=(X_1, max(X_2,Y_2))$
\iffullversion
in the example above
\fi. 2) The predicate $P(\cdot)$ that defines the \texttt{where} clause, and 3) which columns are being joined on.
\iffullversion
 We require that both $S$ and $P$ must be expressible in the framework of \cite{aby3} which we provide a custom implementation of.
\fi

\iffullversion
So far we have discussed inner join (intersection) between two tables. 
\fi
Several other types of joins are also supported including left and right joins, set union and set minus (difference) and full joins. A left join takes the inner join and includes all of the missing records from the left table. For the records solely from the left table, the resulting table contains \texttt{NULL} for the columns from the right table. Right join is defined symmetrically. A full join is a natural extension where all the missing rows from $X$ and $Y$ are added to the output table.

We define the union of two tables to contain all records from the left table, along with all the records from the right table which are not in the intersection with respect to the join-keys. Note that this definition is not strictly symmetric with respect to the left and right tables due to rows in the intersection coming from the left table. %\footnote{We define union in this way to resolve the ambiguity around which table the records in the intersection should come from. For example, consider two tables each with one row $X[0]=(k, x_2,x_3,...), Y[0]=(k, y_2,y_3,...)$. The union of $X,Y$ with $X_1,Y_1$ being the join-keys should only contain one row $(k,...)$ but which values $x_2,x_3,...$ or $y_2,y_3,...$ that should be in the output table is ambiguous. We therefore define it to be $(k,x_2,x_3,...)$. A full join allows both sets of values to be in the output.}. 
Table minus is similarly defined as all of the left table whose join-column value is not present in the right table. 

Beyond these various join operations, our framework supports two broad classes of operations which are a function of a single table. The first is a general SQL select statement which can perform computation on each row (e.g. compute the max of two columns) and filter the results using a \texttt{where} clause predicate. The second class is referred to as an aggregation which performs an operation across all of the rows of a table --- for example, computing the sum, counts, or the max of a given column. 
\subsection{Our Results}

We present the first practical secure multi-party computation protocol for performing SQL-style database joins with linear overhead and constant rounds. Our protocol is fully composable in that the input and output tables are generically secret shared between the parties. 
\iffullversion
	No partial information is revealed at any point. 
\fi
	We achieve this result by combining various techniques from private set intersection and secure computation more 
\iffullversion 
	broadly while requiring the joined on keys be unique.
\else
	broadly.
\fi
 We build on the the binary secret sharing technique of \cite{highthroughput} with enhancements described by \cite{aby3}. We then combine this secret sharing scheme with cuckoo hashing\cite{usenix:PSZ14}, an MPC friendly PRF\cite{lowmc} and a custom protocol for evaluating an oblivious switching network\cite{MS13}. Using these building blocks our protocol is capable of computing the intersection of two tables of $n=2^{20}$ rows in 4.9 seconds. 
\iffullversion 
	Alternatively, the cardinality of the intersection can be computed in just 3.1 seconds.
\fi
  Beyond these two specific functionalities, our protocol allows arbitrary computation applied to a shared table. Compared to existing three party protocols with similar functionality (composable), our implementation is roughly $1000\times$ faster. When compared with \emph{non-composable} two party protocol, we observe a larger difference ranging from our protocol being $1.25\times$ slower to $4000\times$ faster depending on the functionality. 


Building on our proposed protocol we demonstrate its utility by showcasing two potential applications. The first prototype would involve running our protocol between and within the states of the United States to validate the accuracy of the voter registration data in a privacy preserving way. The Pew Charitable Trust\cite{pew} reported 1 in 8 voter registration records in the United States contains a serious error while 1 in 4 eligible citizens remain unregistered. Our privacy preserving protocol identifies when an individual's address is out of date or more seriously if someone is registered to vote in more than one state which could allow them to cast two votes. 
\iffullversion
Additionally, our protocol can help register eligible citizens. Our protocol ensures that only the minimum amount of information is revealed, namely the identities of individuals with serious registration errors.
\fi 
Due to how the data is distributed between different governmental agencies, it will be critical that our protocol allows for composable operations. We implement this application and demonstrate that it is practical to run at a national scale (250M records) and low cost.

The second application that we consider allows multiple organizations to compare computer security incidents and logs to more accurately identify unwanted activities, e.g. a bot net. Several companies already offer this service including Facebook's ThreatExchange\cite{threat} and an open source alternative\cite{alt_threat}. One of the primary limitations of these existing solutions is the requirement that each organization send their security logs to a central party, e.g. Facebook. We propose using our protocol to distribute the trust of this central party between three parties such that privacy is guaranteed so long as there is an honest majority.


\input{related}
